Mass Spectrometry

ABSTRACT

The present invention is concerned with methods for the de novo sequencing of polypeptides from data obtained from mass spectrometry devices, particularly from (MS) n  devices.

The present invention is concerned with methods for the de novosequencing of polypeptides from data obtained from mass spectrometrydevices, particularly from (MS)^(n) devices.

It is currently well known in the art to use mass spectrometry toconfirm the identity of a sample protein/polypeptide (the two terms areinterchangeable herein unless stated otherwise). Protein massfingerprinting programs such as MASCOT (based on the MOWSE algorithm)use mass spectrometry data generated from the enzymatic digestion (usinge.g. Trypsin) of a protein to attempt to identify it from primarysequence databases (Matrix Science Ltd, GB; Perkins et al.,Electrophoresis. 1999 December; 20(18):3551-67; PMID: 10612281). In theprior art approaches to identifying proteins from mass spectrometrydata, the experimental data are peptide molecular weights (in the formof mass to charge ratios) from the digestion of a protein by an enzyme.Other approaches use tandem mass spectrometry data from one or morepeptides (also known as MS/MS and MS²), an ion species of interest beingselected and fragmented to give hierarchical product ion spectra. Stillothers combine mass data with amino acid sequence data.

Notably, these techniques do not actually derive a polypeptide sequencefrom the MS or MS/MS data and instead provide a score or probability bywhich the mass spectrometry data is compared to the (already known)sequences on databases, and an experimenter can select the preferreddatabase sequence (i.e. the one with the best score or highestprobability) as being a likely candidate for the protein being analysed.

However, these prior art methods are unable to directly use dataproduced by the most recently developed mass spectrometry methods,namely multiple tandem mass spectrometry ((MS)^(n), tandem intime/space) since it results in the generation of extremely largevolumes of hierarchical product ion data which is too complex to becompared to the databases. Furthermore, the prior art methods areincapable of directly deriving actual sequence data from massspectrometry data, particularly from the highly complex (MS)^(n)spectra. Current (MS)^(n) devices include MS/MS (tandem, i.e. n=2) massspectrometers, as well as devices such as the Kratos Axima QIT TOF massspectrometer.

Papayannopoulos, IA (“The interpretation of collision-induceddissociation tandem mass spectra of peptides”, Mass Spectrom. Rev.,1995, 14(1) 49-73) discusses the interpretation of MS/MS tandem massspectra of peptides, and computer-program based solutions. However, themethodology taught in the paper is an extremely subjective one, isapplicable only to small peptides (maximum size of about 20 aminoacids), and provides no specific teaching of how a given set of datashould be interpreted to give a candidate amino acid sequence(s). AtSection “VII. Conclusions” it concludes that “Furthermore, continuingimprovements in mass spectrometric instrumentation make it possible toacquire tandem CID mass spectral data of larger and larger peptides andsmall proteins; although at present such data appear to be of limitedpractical utility, . . . ” and that “ . . . although the correctinterpretation of tandem CID mass spectra of peptides of unknownstructure can sometimes be difficult . . . using MS/MS data to confirmthe expected sequence of peptides, or to identify modifications inotherwise known sequences, is within the reach of anyone seriouslyinterested in the application of tandem mass spectrometry . . . .”

In addition, there are a number of errors in the Papayannopoulos paper(above) meaning that if followed, incorrect masses will be calculatedfor various daughter, displacement and neutral loss ions, and/or thatm/z peaks will be incorrectly attributed to ion species in a spectrum.

Thus a person following the teaching of Papayannopoulos would notachieve the results of the present invention, as detailed below. Itshould also be noted that the teaching of Papayannopoulos is onlyconcerned with MS/MS (i.e. MS²) spectra, and not MS^(n) spectra wheren>2.

The present invention seeks to overcome the prior art disadvantages andprovides for a significant and substantial advance in the field ofprotein sequencing

According to the present invention there is provided a method fordetermining at least one putative (i.e. candidate) amino acid sequencefor a sample polypeptide, said sample polypeptide being partiallydegraded, said method comprising the steps of:

-   -   (i) obtaining a soft ionisation mass spectrum of said partially        degraded sample polypeptide giving a set of m/z peaks of ion        species obtained from said partially degraded sample        polypeptide;    -   (ii) with said set of m/z peaks obtained in step (i),        determining a plurality of candidate m/z peak sets, each m/z        peak in each candidate m/z peak set differing from its at least        one neighbour by the mass of an amino acid;    -   (iii) with each candidate m/z peak set determined at step (ii),        determining a sequence of mass differences between each m/z peak        and its at least one neighbour and discarding any candidate m/z        peak sets whose mass difference sequence in reverse order does        not form at least part of the mass difference sequence of        another of said candidate m/z peak sets;

(iv) identifying any Difference Sets in the remaining m/z peak sets;

(v) with the remaining candidate m/z peak sets, identifying anddiscarding any candidate m/z peak set which is a contiguous subsequenceof another candidate m/z peak set; and

(vi) determining a putative amino acid sequence for each remainingcandidate m/z peak set, each amino acid sequence being that of the aminoacids which correspond to the mass differences between each m/z peak andits at least one neighbour, each putative amino acid sequence comprisingan at least partial putative amino acid sequence of said samplepolypeptide.

By the “at least one neighbour” of an m/z peak is meant the closest m/zvalue above and/or below the m/z peak value. So, for example, in anominal set of m/z peaks having values 375, 300, 347, 372 and 331, thepeak value 331 has two neighbours, namely 300 and 347.

The sample polypeptide mass can be at least 3000 Da, for example atleast 4000, 5000, 6000, 7000, 8000, 9000, 10000 or 15000 Da. The partialdegradation of the sample polypeptide can result in fragments havingmasses of up to e.g. 3000 or 4000 Da.

The soft ionisation mass spectrum can give at least 3 m/z peaks, forexample at least 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 40, 50, 75 or 100m/z peaks.

Bach candidate m/z peak set can comprise at least 3 m/z peaks, forexample at least 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 40 or 50 m/zpeaks.

The present invention allows the production of candidate amino acidsequences from mass spectra which, to date, have been considereduninterpretable. This is achieved by the use of reverse sequences atstep (iii), which has not previously been suggested. The generation ofall possible candidate mass peak sets which could correspond to m/zpeaks obtained from a sample polypeptide also helps ensure that allpossible candidate sequences are considered. Together with the use ofinductive and/or deductive processes to identify Difference Sets, thepresent invention is able to achieve de novo sequencing of a samplepolypeptide, a significant advantage over the prior art.

Soft ionisation techniques are well known in the art and, generallyspeaking, result in minimal fragmentation of the sample being ionised,and typically work well with polar and thermally labile compounds.Examples of soft ionisation techniques include (but are not limited to)matrix-assisted laser desorption and ionisation (MALDI), electrosprayionisation (EI), atmospheric pressure chemical ionisation (APCI), fastatom bombardment (FAB) and chemical ionisation (CI) etc. In particularMALDI-time of flight (MALDI-TOF) mass spectrometry can be used in thepresent invention.

Examples of matrix molecules which can be used with MALDI include:2-amino-4-methyl-5-nitropyridine, 2-amino-5-nitropyridine,6-aza-2-thiothymine, caffeic acid, α-cyano-4-hydroxy cinnamic acidCACTI), 2,5-dihydroxybenzoic acid (gentisic acid, DHB),2,5-dihydroxybenzoic acid and fucose (1:1), ferulic acid, glycerol withrhodamine 6G (0.1M), 2-(4-hydroxyphenylazo)benzoic acid (HABA),3-hydroxypicolinic acid (HPA), nicotinic acid, 3-nitrobenzyl alcohol,3-nitrobenzyl alcohol with rhodamine 6G, 3-nitrobenzyl alcohol with1,4-diphenyl-1,3-butadien (0.1M), 2-pyrazinecarboxylic acid,3,5-dimethoxy-4-hydroxy cinnamic acid (sinapinic acid, SA), and succinicacid, and a skilled person will be readily able to determine exactlywhich matrix molecules can be used with a given sample polypeptide.Other matrix molecules which my be used include 5-chloro-2-hydroxybenzoic acid (5-chlorosalycylic acid, CSA), trans-indole-acrylic acid(IAA), 5-methoxysalicylic acid (5-methoxy-2-hydroxybenzoic acid, MSA),norharmane (9H-pyrido[3,4-b]indole, nH), picolinic acid(2-pyridineboxylic acid, PA), 2,4,6-trihydroxy-acetophenone (THAP), and1,8,9-trihydroxy-anthracene (dithranol), and cobalt ultra fine metalpowder.

Although peptide ions typically fragment at the peptide backbone toproduce series of fragment ions (for example as a result of internalenergy derived in the ionisation process), this can be supplemented bye.g. colliding the sample polypeptide ions with neutral gas molecules(i.e. by employing collision-induced decomposition in electrospray massspectrometry). In MALDI spectrometry, the fragmentation of aprotein/peptide into ions can result not only from initial ionisation,but also by other events such as post source decay (PSD) which can beobtained along the trajectory of ions along the time of flight (TOF)mass spectrometer.

The ionisation of proteins results in the generation of a number of ionspecies as a result of the structure of the protein (see FIG. 1). Forexample, the cleavage of a CH—CO bond results in the generation of a-and x-daughter ions. The a-ions are N-terminal fragments and the x-ionsare C-terminal fragments. Similarly, the cleavage of a backbone CO—NHbond results in the generation of N-terminal b- and C-terminaly-daughter ions. The cleavage of an NH—CH bond results in the generationof N-terminal c- and C-terminal z-daughter ions. The most frequentlygenerated ion species are the b- and y-daughter ions together witha-daughter ions.

It is also possible for End Ion Clusters to be generated from each ofthe different daughter ion species. Each of the N-terminal ion species(i.e. the a-, b- and c-daughter ions) can produce hybrid ions at a massof the daughter ion minus the mass of the terminal group (16 Daltons).

In addition, a-ions can be protonated, giving a hybrid peak at thedaughter ion mass +1 Dalton. Other hybrid peaks can also arise at thedaughter ion mass −2 Daltons and +1 Dalton corresponding, respectively,to y_(n)−2 and z_(n)+1 ions.

In an example sequence NH₂—CHR₁—CO—NH—CHR₂—CO—NH—CHR₃—COOH an abccluster (i.e. c-ion is 17 Daltons heavier than a b-ion which in turn is28 Daltons heavier than an a-ion) can be generated by each CHR—CO—NH, ascan a corresponding xyz cluster. Ions and clusters of ions from adjacentamino acids in the protein are separated by the mass of the amino acid.

These various daughter ions and end ion dusters are generally referredto herein as defining “Difference Ions” and “Difference Sets” and theycan be identified in the candidate m/z peak sets in a number of ways.Difference Sets include Displacement Sets, which can identify ion series(e.g a-, b-, c-, x-, y-, and z-), and Neutral Loss Sets, which canprovide information about amino acids.

The present invention, with its various filtering and identificationsteps, is able to identify all the potential amino acid sequences whichcould result in a given set of m/z peaks, and remove those amino acidsequences which are incorrect or which are contiguous subsequences of alarger set (sequence), to produce a final result of at least oneputative amino acid sequence. For some amino acid sequences, the finalresult may just be a single putative amino acid sequence, whereas forother (particularly, larger) amino acid sequences it may produce morethan one putative amino acid sequence, corresponding to fragments of theoriginal sample polypeptide, or which may result from there being morethan one possible solution to a particular set of m/z peaks.

Importantly, the present invention creates de novo putative amino acidsequences, and does not rely upon the sample polypeptide and itscorresponding set of m/z peaks being in a database and being correlatedwith an m/z peak set generated by mass spectrometry of the samplepolypeptide.

The sample polypeptide can be partially degraded using an enzymeselected from the group consisting of exopeptidases and endopeptidases.These are the two basic sub-classes of proteases, exopeptidases causinga proteolytic digestion of polypeptides from their N-/C-termini inwards,and endopeptidases causing a proteolytic digestion of polypeptides atspecific amino acid sequences within the polypeptide.

Endopeptidases can be particularly useful since they act to provideadditional information about the sample polypeptide when the at leastone putative amino acid sequence is created or reviewed, and inparticular can be used to confirm that a specific sequence or sequencesmust be present in the sample polypeptide.

For example, the endopeptidase trypsin can be used in the presentinvention. Other useful endopeptidases are well known and will bereadily apparent to the skilled person.

In the method of the invention, the plurality of candidate nili peaksets can be identified at step (ii) by:

-   -   (a) from the set of m/z peaks obtained in step (i) having x        peaks, identifying every possible candidate m/z peak set having        between 2 and x members; and    -   (b) discarding every candidate m/z peak set having a mass        difference between any one of its m/z peaks and its at least one        neighbour which is not equal to the mass of an amino acid.

For example, at step (b), it is possible to determine in each candidatem/z peak set the mass difference between each m/z peak and its at leastone neighbour, and to then discard every candidate m/z peak set havingany mass difference which is not equal to the mass of an amino acid.

Thus a choice can be made between identifying only a certain number ofpossible candidate m/z peak sets, and determining all possible candidatem/z peak sets. The latter is obviously the more thorough of the twooptions, and particularly in the situation where nothing is known aboutthe sample polypeptide it is typically the preferred option. It however,anything is known about the sample polypeptide, such as terminal aminoacid sequence, then it is possible to use only those sets of candidatem/z peaks conforming with the known information about the polypeptide.

Such filtering can, of course, be applied at any convenient stage in theprocess of determining the at least one putative amino acid sequence.and issues such as programming simplicity, flexibility, and/or resourceusage may be taken into consideration when determining the point orpoints at which any filtering is to be done.

The generating of sets of candidate m/z peaks can also be affected bythe amino acid masses with which the mass differences are compared. Forexample, the set of amino acid masses may simply consist the masses ofstandard amino acids. Alternately, if it is known that a samplepolypeptide does not comprise a certain amino acid then the mass of thatcertain amino acid can be excluded. Similarly, amino acid masses of e.g.chemically and/or post-translationally modified amino acids can also beused. The masses of other amino acids, both naturally occurring andsynthetic, can also be used, and can include modified and unusual aminoacids such as 2-aminoadipic acid, 2-aminobutyric acid, isodesmosine,6-n-methyllysine, and norvaline. Others are listed in for example Table4 of WIPO Standard St.23. Similarly, allowance can be made forisotopically labelled amino acids.

Obviously, if a given amino acid is known not to be in the samplepolypeptide then as an alternative to excluding its mass from the set ofamino acid masses available for determining candidate m/z peak sets, thecandidate m/z peak sets can be generated anyway and any which includethe given amino acid can be discarded. This is, however, morecomputationally complex than excluding its mass from the set of aminoacid masses available for determining candidate m/z peak sets and so maynot be a preferred option.

The filtering process employed at step (iii) is referred to as the“Reflective Predicate Filter” and is not suggested anywhere in the art.It is particularly useful in that it correlates sets of daughter ionsgenerated by the mass spectrometry—if an at least partial reverse-ordermass difference sequence does not exist for a mass difference sequencedetermined for a candidate set of m/z peaks then the candidate m/z peakset is discarded. This step essentially removes from the sets ofcandidate m/z peaks (i.e. candidate amino acid sequences) those whichinclude inappropriate daughter ions. For example, a set of m/z peakseach separated by the mass of an amino acid might include a number ofb-daughter ions, together with an x-daughter ion. If the set of peaksconsisted solely of b-daughter ion peaks then a complementary set ofy-daughter ion peaks separated by the same mass differences should existand therefore the candidate sequence would not be discarded. However, inthe case of a set of b-daughter ions and an x-daughter ion, thecomplementary set of mass peaks would have to consist of a set ofy-daughter ions and an a-daughter ion, and the mass differences wouldnot correspond to those of amino acids or would be otherwise incorrect,and thus the candidate sequence having the x-daughter ion would bediscarded. In the present context, “at least part” means at least 2, 3,4, 5, 6, 7, 8, 9, 10, 15, 20 or 25 mass differences of the massdifference sequence of another amino acid. Of the two comparedsequences, one must be a contiguous sub-sequence of the other.

At step (iv), Difference Sets can be identified in a number of ways,which can be broadly split up into deductive and inductive processes.The deductive processes use logical rules to determine how the m/z peaksets should be interpreted. A first deductive process for determiningDifference Sets comprises the steps of:

-   -   (a) comparing each of the remaining candidate m/z peak sets;    -   (b) correlating the results of comparison step (a) to identify        any Difference Sets comprising a first of said candidate m/z        peak sets which forms at least part of a second of said        candidate m/z peak sets displaced by any one of the group        consisting of −17 u, −18 u, −34 u and −48 u; and    -   (c) labelling the members of any −17 u Difference Sets as        putatively containing an amino acid selected from the group        consisting of: Asparigine, Glutamine, Lysine and Arginine,        labelling the members of any −18 u Difference Sets as putatively        containing an amino acid selected from the group consisting of:        Serine, Threonine, Glutamic acid and Tyrosine, labelling the        members of any −34 u Difference Sets as putatively containing a        Cysteine amino acid, labelling the members of any −48 u        Difference Sets as putatively containing a Methionine amino        acid, and labelling the lighter member of each Difference Set as        being a neutral loss candidate m/z peak set.

When labelling such Difference Sets, both the first and second candidatem/z peak sets may be labelled regarding their putatively containing theabove amino acids.

Alternatively or additionally, Difference Sets can be determined by:

-   -   (a) comparing each of the remaining candidate m/z peak sets;    -   (b) correlating the results of comparison step (a) to identify        any Difference Sets comprising a first of said candidate m/z        peak sets which forms at least part of a second of said        candidate peak sets displaced by any one of the group consisting        of +28 u, +17 u and −26 u; and    -   (c) labelling the heavier and lighter of any +28 u Difference        Sets respectively as putative b- and a-Difference Sets,        labelling the heavier and lighter of any −26 u Difference Sets        respectively as putative x- and y-Difference Sets, and labelling        the heavier and lighter of any +17 u Difference Sets        respectively as putative c- and b-Difference Sets.

The amino acid sequences represented by the candidate m/z peak sets canbe determined by a simple look-up of the mass differences between them/z peak values against a set of amino acid masses. Thus a set of m/zpeak values can be readily translated into an amino acid sequence. Thiscan be done starting from the heaviest m/z peak value or from thelightest or in any other order. However, the resultant sequence stillneeds to be assigned a directionality—the ion species from which thesequences are determined could be a-, b-, or c-ions which will give,going from the heaviest m/z peak value to the lightest, an amino acidsequence in the direction C to N. Alternatively, the ion species fromwhich the sequences are determined could be x-, y-, or z-ions which willgive, going from the heaviest m/z peak value to the lightest, an aminoacid sequence in the direction N to C. Amino acid sequences are commonlyrepresented in the N to C direction and sequence derived from a-, b- andc-ion species must not be confused with sequence derived from x-, y- andz-ion species.

Thus a step in the process of determining a putative amino acid sequencecan be to determine the directionality of the amino acid sequencesrepresented by the candidate m/z peak sets, and this can be done asdiscussed by Papayannopoulos, IA (supra). By determining the precursormass for the sample polypeptide and the known candidate m/z peak setvalues, the heaviest values in each of the sets can be compared with theprecursor ion mass to identify any differences which correlate to themass of an amino acid, or to the mass of an amino acid+18 u. If acandidate m/z peak set is found whose greatest m/z peak value is equalto the sample polypeptide precursor mass minus the mass of an amino acidthen the candidate m/z peak set is a y-series whose C-terminal member isthe amino acid corresponding to the mass difference. Alternatively, if acandidate m/z peak set is found whose greatest m/z peak value is equalto the sample polypeptide precursor mass minus 18 and minus the mass ofan amino acid then the candidate m/z peak set is a b-series whoseN-terminal member is the amino acid corresponding to the mass differenceplus 18.

The step of determining the directionality can be referred to as beingthe Classification Predicate. Once an m/z peak set has been identifiedas being a-, b-, e-, x-, y- or z-series (but particularly a b- ory-series) (and thus its directionality determined), Difference Sets fromthe series can then be identified, and can be used for scoring (below).

Thus the method of the present invention may additionally comprise thestep of determining the precursor mass for the sample polypeptide. Theprecursor mass may also be useful for reasons other than theidentification of Difference Sets and so may in any event be determined.

Thus alternatively or additionally, Difference Sets can be determinedby:

-   -   (a) with the remaining candidate m/z peak sets, calculating the        difference between the heaviest m/z value in each candidate m/z        peak set and said precursor mass of said sample polypeptide;    -   (b) comparing the differences with the masses of amino acids and        with the masses of amino acids +18 u; and    -   (c) correlating the results of comparison step (b), a difference        equal to the mass of an amino acid indicating a y-series        Difference Set having the amino acid at its C-terminus, and a        difference equal to the mass of an amino acid+18 u indicating a        b-series Difference Set having the amino acid at its N-terminus.

Thus the direction of a candidate m/z peak set can be determined, andthus so can the direction of a putative amino acid sequence derived fromthe candidate m/z peak set.

In particular the combination of the above three deductive processes toidentify Difference Sets can provide valuable information allowing thedetermination of putative amino acid sequences for the samplepolypeptide.

As well as acting as a filter to simplify the candidate m/z peak sets,the identification of Difference Sets can also be used as the basis of ascoring system in order to allow the allocation of a score to eachremaining candidate m/z peak set (i.e. to each putative amino acidsequence). Thus when the results of the method of the present inventionare determined, they can be provided together with their scores.

For the various series, and particularly for b- and y-series, scores canbe calculated by counting the number of Displacement m/z values (i.e.Displacement Masses) for each series, and comparing this with the numberof m/z values in the respective series (as appropriate), e.g. b- andy-series, the greater the number of possible Displacement m/z valuesfound for the number of m/z values in the series (e.g. b- or y-series),the greater the likelihood of the sequence being correct, and thereforea score can be allocated as appropriate.

For example, scores can be calculated by counting the number ofDisplacement m/z values (Displacement Masses) for each DisplacementSeries obtainable from each main series, and dividing this by the numberof m/z values in the main series, giving a value of <=1 for eachDisplacement Series. Thus for example with a main b-series having 5masses, and an b-18 series having 3 masses, a score of ⅗ (0.6) is given.

In the case of b-series, a-series Displacement Masses can also beincluded in the b-series score. Thus for a b-series, Displacement Seriesmembers of b-17, b-18, a, a-17 and a-18 can be included, correspondingto Displacement Masses of −17, −18, −28, −45 and −46 respectively.

In the case of y-series, only y-series Displacement Masses are includedin the y-series score. Thus for a y-series, Displacement Series membersof y-17 and y-18 are included, corresponding to Displacement Masses of−17 and −18.

This is further shown in Table 7.

Thus a score can be allotted to each remaining candidate m/z peak setcomprising a main a-, b-, c-, x-, y- or z-series, said score beingcalculated by:

-   -   (a) determining the number of Displacement m/z values in each        displacement series obtainable from said main series;    -   (b) correlating the results of step (a) with the number of m/z        values in said main series; and    -   (c) allotting a score determined from the result of correlation        step (b) to said main series.

In particular, this can be done with a main b- or y-series.

For b-series, an additional scoring factor, based on classifying thehighest mass in a series as either the highest mass or the secondhighest mass in a b-series, can be calculated. For y-series, the highestmass is classified against only a y-series second highest mass, as thehighest mass in a y-series is the same as that of the protonatedprecursor ion mass. If the highest mass in a y-series does not meet thiscriterion then an attempt to classify the lowest mass in the series as ay1 ion is made. If either of the criteria is met for a b- or y-seriesthen the score can be incremented, for example by 1.0, giving acomposite score. The composite scoring process is shown diagrammaticallyin FIG. 6.

As mentioned above, the present invention can be used with the latest(MS)^(n) mass spectrometers, where (MS)^(n) is at least 2, for example3, 4 or 5, a (MS)^(n) spectrum (or (MS)^(n) spectra) constituting thatobtained at step (i). With regard to (MS)^(n) data which is generatedfor a sample polypeptide, the data consists of a mass spectrum for thesample polypeptide and at least one set of precursor ion mass spectra,each one of these being determined for a selected precursor ion.Candidate m/z peak sets can thus be determined for each precursor ionmass spectrum and they can then all be combined to give the plurality(i.e. the group) of candidate m/z peak sets of step (iii).Alternatively, the candidate m/z peak sets for the precursor massspectra can be added to their respective precursor parent ions to defineextended mass spectra, from each of which can then be determined aplurality of candidate m/z peak sets.

Specifically, in the case of using MS^(n) spectra where n>2 to generateextended mass spectra, each MS^(n) spectrum being generated from aprecursor ion selected from a parent MS^(n−1) spectrum, the precursorion having an m/z value, any peak in the parent MS^(n−1) spectrum havingan m/z value less than that of the precursor ion is removed from theparent MS^(n−1) spectrum, and the MS^(n) spectrum is added to the parentMS^(n−1) spectrum to generate a hybrid MS^(n).MS^(n−1) spectrum. Thus ifa parent MS² spectrum has 3 precursor ions which are each used togenerate an MS³ spectrum then each MS³ spectrum can be used to generatea hybrid MS³.MS² spectrum. Thus a total of four spectra can beanalysed—the MS² spectrum and the three hybrid MS³.MS² spectra.

If a precursor ion is then selected from one of the MS³ spectra andfurther ionised and used to generate an MS⁴ spectrum then this can beused to generate a hybrid MS³.MS⁴ spectrum as above (i.e., in this casen=4., and any peak in the parent MS^(n−1) spectrum having an m/z valueless than that of the precursor ion is removed from the parent MS^(n−1)spectrum, and the MS^(n) is added to the parent MS^(n−1) spectrum togenerate a hybrid MSn.MS^(n−1) spectrum). This hybrid. MS⁴.MS³ spectrumis then added to the other spectra who share a common lowest MS^(n)value (in this case n=3), and they in turn are used to generate hybridspectra with the at least one spectrum having an MS^(n) value one lessthan their common lowest MS^(n) value (i.e. in this case, with thesingle MS² spectrum), giving a total of five spectra to be analysed—theMS² spectrum, the three hybrid MS³.MS² spectra, and the single hybridMS⁴.MS³.MS² spectrum.

Obviously, this methodology can be extended to any value of n where n>2,for example where n=5, 6, 7, 8, 9 or 10. No fundamental limitations areplaced upon the system other than the availability of ion species to actas precursor ions. In practical terms, the methodology does have thepotential to provide for a substantial increase in the number of spectrato be analysed and therefore the volume of data to be analysed. However,the various filter predicates of the present invention, particularly theMass Difference Predicate (step (ii)) and the Reflective Predicate (step(iii)) readily reduce the volume of data and the number of putativeamino acid sequences which are generated.

Thus the use of MS^(n) spectra where n>2 can result in a tree-like datastructure (i.e. in which normal recursive navigation can be effected),with a spectrum for analysis being generated at the tree trunk (n=2) andhybrid spectra being generated at each branch. This recursive iterationof the possible spectra and the generation of a “tree” type structure ofspectra is shown at FIGS. 8-13.

The generation of such spectra and hybrid spectra can be expressed asfollows:

The above method for determining putative amino acid sequences for asample polypeptide can be considered to comprise a deductive process.However, the present invention also extends to the use in inductiveprocesses to determine putative amino acid sequences. In particular,supervised machine learning algorithms can be used in order to determineputative sequences from MS and (MS)^(n) data.

In addition or alternatively, inductive methods may be employed toidentify Difference Sets, particularly ion series. For example,Difference Sets (e.g. ion series) may be determined by:

-   -   (a) passing the remaining candidate m/z peak sets as an input to        a computer executing program code for a supervised learning        algorithm, said supervised learning algorithm being trained to        identify Difference Sets; and    -   (b) outputting identified Difference Sets in said remaining m/z        peak sets from said computer.

Supervised learning algorithms useful in the present invention includek-NN (T. M. MITCHELL, “Machine Learning”, McGraw-Hill InternationalEditions, 1997), C4.5 (J. R. QUINLAN, “C4.5: Programs for MachineLearning”, Morgan Kaufmann, 1993), CN2 (P. CLARK, T. NIBLETT, “The CN2induction algorithm”, Machine Learning, 3(4):261-283, 1989; P. CLARK, R.BOSWELL, “Rule induction with CN2: some recent improvements”, inProceedings of ECML'91, pp. 151-163, 1991; R. RAKOTOKALALA, D. ZIGHED,F. FESCHET, “Empirical evaluation of rule characterization in ruleinduction process”, in Proceedings of the Fourteenth European Meeting onCybernetics and Systems Research, pp. 779-804, 1998), RBF (Radial BaseFunction neural network), and OC1 (Murthy S K et al., “A System forinduction of Oblique Decision Trees”, Journal of Artificial IntelligenceResearch 2 (1994) 1-32).

To briefly review the above algorithms, the k-NN algorithm compares anunknown data point (from within a new data set) with k nearestneighbours from previously classified data points. With this method, thek nearest neighbours to the unknown point are most likely to be from thepoint's proper population. With this algorithm it may be necessary toreduce the weight attached to some variables by suitable scaling, suchthat at one extreme variables may be removed completely if they don'tcontribute usefully to the discrimination. This can be determined andaddressed experimentally.

The C4.5 algorithm generates decision trees to effectively generatetests to partition data. The algorithm uses an entropy based measure inorder to determine the quality of the tests available. However, thisalone would favour tests, which reduce the level of uncertainty of theclass so C4.5 also uses a modified measure to ensure that it is notbiased towards tests with many outcomes. The advantage that thisalgorithm has over others is that it supports estimated error-basedpruning so ensuring that the performance is not reduced due toover-fitting.

The CN2 algorithm has a key advantage over similar methods in its class,namely that it has an ability to cope with ‘other complications’ in thedata. During its search for complexes CN2 does not automatically removefrom consideration candidates which are included in more than onenegative example. It reassigns a set of complexes in its search which isevaluated statistically as covering a large number of examples of agiven class and few of other classes. The manner in which CN2 conduct asearch is generic-to-specific. Each trial specialisation step takes theform of either adding a new conjunctive term or removing a disjunctiveterm. Once CN2 has found a good complex, the algorithm removes thoseexamples it covers from the training set and adds if <complex> thenpredict <class> to the end of the rule list. The process terminates foreach given class when no more acceptable complexes can be added to thelist.

The RBF algorithm is based on neural network technology where a networkof nodes is generated that emulate the operation of the human synapticnerve, junction—these are known as nodes. The RBF network consists of alayer of nodes, which perform linear or non-linear functions of theattributes, followed, by a layer of weighted connections to the nodeswhose outputs have the same form as the target vectors. It has astructure similar to the Multilayer perception (MLP) except that eachnode of the hidden layer computes n arbitrary functions of the inputs,and the transfer function of each output node is the trivial identityfunction. The hidden layer has parameters appropriate for whateverfunctions are being used such as Gaussian widths and positions.

The main advantages that the RBF algorithm has over other neural netalgorithms are that it includes a linear training rule once thelocations in attribute space of the non-linear functions have beendetermined, an underlying model including localised functions in theattribute space, rather than long-range functions occurring in othermodels. The linear learning rule avoids problems associated with localminima, especially since this enables the enhanced ability to makestatements about the probabilistic interpretation of the outputs.

OC1 is a machine learning, decision tree algorithm, which, unlike C4.5,makes its decisions on various boundaries based upon single attributes(termed oblique decisions). The OC1 uses linear combinations ofattributes in decision making and requires all attributes to be numeric.

Each supervised learning algorithm, when running as part of a computerprogram, needs to be provided with a training data set in order that itcan learn to interpret new sets of data with an acceptable probabilitythat it will return the correct results—this process (of learning from atraining data set and then using it to make predictions and/orclassifications of another data set) is referred to as “generalisation”.The process of generalisation allows the partitioning of data into aseries of pre-defined classes (such as b- or y-Difference Sets)

Since the supervised learning algorithm is to determine the identity ofDifference Sets, the training data needs to comprise Difference Sets,e.g. a-, b-, and y-ion series. Training data can also comprise negativeexamples of m/z peak sets, as well as data for other Difference Setssuch as x- and z-ion sets. Similarly, training data can be provided fore.g. w-Difference Sets. Training data can also be used representingneutral loss sets.

Also provided according to the present invention is a method for using acomputer to determine at least one putative amino acid sequence for asample polypeptide, said sample polypeptide being partially degraded,said method comprising the steps of:

-   -   (i) obtaining a soft ionisation mass spectrum of said sample        polypeptide giving a set of m/z peaks of ion species obtained        from said partially degraded sample polypeptide;        and with said computer:    -   (ii) with said set of m/z peaks obtained in step (i),        determining a plurality of candidate m/z peak sets, each m/z        peak in each candidate m/z peak set differing from its at least        one neighbour by the mass of an amino acid;    -   (iii) with each candidate m/z peak set determined at step (ii),        determining a sequence of mass differences between each m/z peak        and its at least one neighbour and discarding any candidate m/z        peak sets whose mass difference sequence in reverse order does        not form at least part of the mass difference sequence of        another of said candidate m/z peak sets;    -   (iv) identifying any Difference Sets in the remaining m/z peak        sets;    -   (v) with the remaining candidate m/z peak sets, identifying and        discarding any candidate m/z peak set which is a contiguous        subsequence of another candidate m/z peak set; and    -   (vi) determining a putative amino acid sequence for each        remaining candidate m/z peak set, each amino acid sequence being        that of the amino acids which correspond to the mass differences        between each m/z peak and its at least one neighbour, each        putative amino acid sequence comprising an at least partial        putative amino acid sequence of said sample polypeptide.

The mass spectrum can be simply supplied as a data set from a massspectrometer located is locally or remotely, and can for example be amass spectrum stored on a computer database or another storage medium.

The computer may return its results in any desired form, e.g. as an atleast one putative sequence, optionally with any value score or scoresas discussed above, or e.g. such as statistical data, regarding the atleast one putative amino acid sequence.

Also provided is a system for determining at least one putative aminoacid sequence for a sample polypeptide, said sample polypeptide beingpartially degraded, said system comprising:

-   -   (a) a memory for storing machine instructions for:        -   (i) with a set of m/z peaks obtained from a soft ionisation            mass spectrum of said sample polypeptide giving said set of            m/z peaks of ion species obtained from said partially            degraded sample polypeptide, determining a plurality of            candidate m/z peak sets, each m/z peak in each candidate m/z            peak set differing from its at least one neighbour by the            mass of an amino acid;        -   (ii) with each candidate m/z peak set determined at step            (i), determining a sequence of mass differences between each            m/z peak and its at least one neighbour and discarding any            candidate m/z peak sets whose mass difference sequence in            reverse order does not form at least part of the mass            difference sequence of another of said candidate m/z peak            sets;        -   (iii) identifying any Difference Sets in the remaining m/z            peak sets;        -   (iv) with the remaining candidate m/z peak sets, identifying            and discarding any candidate m/z peak set which is a            contiguous subsequence of another candidate m/z peak set;            and        -   (v) determining a putative amino acid sequence for each            remaining candidate m/z peak set, each amino acid sequence            being that of the amino acids which correspond to the mass            differences between each m/z peak and its at least one            neighbour, each putative amino acid sequence comprising an            at least partial putative amino acid sequence of said sample            polypeptide; and    -   (b) a processor that is coupled to said memory, said processor        executing said machine instructions, causing said processor to        determine at least one putative amino acid sequence for said        sample polypeptide.

Also provided is a computer program for determining at least oneputative amino acid sequence for a sample polypeptide, said samplepolypeptide being partially degraded and a soft ionisation mass spectrumof said sample polypeptide having been obtained giving a set of m/zpeaks of ion species obtained from said partially degraded samplepolypeptide, said computer program comprising:

-   -   (i) program code for determining a plurality of candidate peak        sets, from said set of m/z peaks, each m/z peak in each        candidate m/z peak set differing from its at least one neighbour        by the mass of an amino acid;    -   (ii) program code for determining from each candidate m/z peak        set determined at step (i) a sequence of mass differences        between each m/z peak and its at least one neighbour, and for        discarding any candidate m/z peak sets whose mass difference        sequence in reverse order does not form at least part of the        mass difference sequence of another of said candidate m/z peak        sets;    -   (iii) program code for identifying any Difference Sets in the        remaining m/z peak sets;    -   (iv) program code for identifying and discarding from the        remaining candidate m/z peak sets any candidate m/z peak set        which is a contiguous subsequence of another candidate m/z peak        set; and    -   (v) program code for determining a putative amino acid sequence        for each remaining candidate m/z peak set, each amino acid        sequence being that of the amino acids which correspond to the        mass differences between each m/z peak and its at least one        neighbour, each putative amino acid sequence comprising an at        least partial putative amino acid sequence of said sample        polypeptide.

Also provided is a computer program product for determining at least oneputative amino acid sequence for a sample polypeptide, comprising acomputer usable medium having computer readable program code meansaccording to the present invention embodied in said medium.

The computer program code for identifying Difference Sets in theremaining m/z peak sets can be written in any appropriate language, butthe present inventors have found that Logic Programming languages suchas Prolog are particularly useful and effective in the presentinvention.

The invention will be further apparent from the following description,with reference to the several figures of the accompanying drawings,which show, by way of example only, one form of determining a putativeamino acid sequence from a set of m/z peaks.

Of the Figures:

FIG. 1 shows cleavage of an amino acid sequence and the generation ofdaughter ion species;

FIG. 2 shows an example set m/z peaks from which a putative amino acidsequence is determined. The sequence of SEQ ID NO: 1 is deduced byclassifying a compound spectrum (top) into ion species spectra (middleand bottom);

FIG. 3 shows a table of amino acid properties. Columns are as follows:A—3 letter amino acid code; B—empirical formula; C—monoisotopic mass(H=1.00782504, C=12.0000000, N=14.0030740, O=15.9949146, S=31.9720710);D—average mass (H=1.0079, C=12.011, N=14.007, O=15.999, S=32.066);E—side chain [nominal]; F—structure; G—Neutral loss (T. Madden, et al.,Org. Mass Spectrom., 26, 443 (1991)) [nominal]; H Immonium ion(s) (K.Ambihapathy, et al., J. Mass Spectrom., 32, 209 (1997), immonium ionsbeing measured by FABMS (Pos.)) [nominal]; I—immonium ion relativeintensity (W=weak, S=strong, V=very strong); J—type (A=apolar,U=uncharged polar, C=charged polar); K—Bull & Breese values (H. B. Bull,K. Breese, Archives Biochem. Biophys., 161, 665-670 (1974));L—isoelectric point; M—occurrence (see for exampleprowl.rockefeller.edu/aainfo/contents.htm;

FIG. 4 shows (A) the basic structure of an amino acid, (B) an examplerepresentation of the structure of glycine, (C) the basic structure of ay_(j) ion, and (D) the basic structure of a b_(j) ion;

FIG. 5 shows (A) the basic structure of an x_(j) ion, (B) the basicstructure of an a_(j) ion, (C) the z-series ionisation mechanism, r_(j)^(b) and r_(j) ^(a) being alternatives to one another, and (D) thec-series ionisation mechanism;

FIG. 6 shows a composite scoring system for adjusting the scoresallocated to top values in a b-series and y-series and for a y₁ ion;

FIG. 7 shows a logic-based Karnaugh map used in determining b- andy-series classification. In Series 1 and Series 2, letters with a-(bar)above them indicate not classified as the series;

FIG. 8 shows a complex (MS)^(n) tree structure showing 4 possibleputative series paths labelled 1-4;

FIG. 9 shows Path 1 putative series masses;

FIG. 10 shows Path 2 putative series masses;

FIG. 11 shows Path 3 putative series masses;

FIG. 12 shows Path 4 putative series masses;

FIG. 13 shows Path 5 putative series masses using only (ms)² data; and

FIG. 14 shows splicing of N- and C-terminal series

EXAMPLES

The following example demonstrates how a putative amino acid sequence isdetermined from a set of m/z peaks. All amino acids (with the exceptionof proline) terminate in the group NH₂CHRCOOH wherein R denoted a sidechain. This basic structure can be represented diagrammatically as shownin FIG. 4A, and an example representation of the structure of glycine isgiven at FIG. 4B.

Starting with the set of m/z peaks shown in the mass spectrum of FIG. 2,the following equations are used. Note that they are different to thosedefined in Scheme 4 of the Papayannopoulos, IA (supra) paper. The basicstructure of a y_(j) ion is given at FIG. 4C, and that for a b_(j) ionat FIG. 4D.

$y_{j} = {\left\lbrack {C\mspace{14mu} {term}} \right\rbrack + {\sum\limits_{i = {({n - j + 1})}}^{n}\; {aa}_{i}} + \lbrack H\rbrack + \left\lbrack H^{+} \right\rbrack}$$b_{j} = {\left\lbrack {N\mspace{14mu} {term}} \right\rbrack + {\sum\limits_{i = 1}^{j}\; {aa}_{i}} + \left\lbrack H^{+} \right\rbrack}$

For unmodified amino acids, where [C-term]=[OH] and [N-term]=[H], thesecan be re-written to represent a- and x-daughter ions (FIG. 5A is anx_(j) ion and FIG. 5B is an a_(j) ion). The equations to represent themasses of these ions are:

a _(j)=[N term]+Σ_(i=1) ^(j) aa _(i)−[CO]

x _(j)=[C term]+Σ_(i=(n−j+1)) ^(n) aa _(i)+[CO]

FIG. 5C shows the z-series ionisation mechanism. ^(a)R and ^(b)Rrepresent substituents at the beta carbon atom of the amino acid, whichlose a hydrogen, which in turn becomes the protonating H⁺. FIG. 5D showsthe c-series ionisation mechanism. Thus:

$c_{j} = {\left\lbrack {N\mspace{14mu} {term}} \right\rbrack + {\sum\limits_{i = 1}^{j}\; {aa}_{i}} + \left\lbrack {NH}_{2} \right\rbrack + \left\lbrack H^{+} \right\rbrack}$${i.e.\mspace{14mu} c_{j}} = {\left\lbrack {N\mspace{14mu} {term}} \right\rbrack + {\sum\limits_{i = 1}^{j}\; {aa}_{i}} + \left\lbrack {NH}_{3}^{+} \right\rbrack}$$z_{j} = {\left\lbrack {C\mspace{14mu} {term}} \right\rbrack + {\sum\limits_{i = {({n - j + 1})}}^{n}\; {aa}_{i}} - \lbrack{NH}\rbrack + \left\lbrack H^{+} \right\rbrack - \lbrack H\rbrack}$${i.e.\mspace{14mu} z_{j}} = {\left\lbrack {C\mspace{14mu} {term}} \right\rbrack + {\sum\limits_{i = {({n - j + 1})}}^{n}\; {aa}_{i}} - \lbrack{NH}\rbrack}$

in the above equations, 1<=j<=n−1“aa_(i)” denotes the mass of the “i”th amino acid. [N-term], [C-term],[CO], [NH], [NH₂], [NH₃ ⁺], [H] and [H⁺] denote the masses of groupenclosed therein, i.e. the mass of the group attached to the N-terminusof an amino acid (usually H=1), the group attached to the C-terminus ofan amino acid (usually OH=17), CO, NH and NH₂.

The above expressions are for the calculation of the mass of singlyprotonated a-, b-, c-, x-, y- and z-peptide fragment ions. Subscript jindicates the jth fragment ion of a peptide comprising n amino acids.The numbering of N-terminal fragment ions begins at the N-terminus, andthe numbering of C-terminal fragment ions begins at the C-terminus.Therefore the jth N-terminal fragment ion of a peptide has a masscomprised of the sum of the masses of the first j amino acids. Incontrast, the jth C-terminal fragment ion of a peptide has a masscomprised of the sum of the masses of the last j amino acids.

From the above equations it can be deduced that:

a _(j) −a _(j−1) =aa _(j)

x _(j) −x _(j−1) =aa _(n−j+1)

b _(j) −b _(j−1) =aa _(j)

y _(j) −y _(j−1) =aa _(n−j+1)

c _(j) −c _(j−1) =aa _(j)

z _(j) −z _(j−1) =aa _(n−j+1)

Thus the differences in the series of m/z peaks generated from a massspectrum of a sample polypeptide will follow the pattern of amino acidmasses in the sequence of the sample polypeptide. These equations arereferred to herein as the “Difference Equations”.

In addition, the following Series Difference Equations can be derived:

b _(j) −a _(j)=[CO]=28

b _(j) −c _(j)=[NH₃]=17

y _(j) −x _(j)=[H₂]−[CO]=−26

y _(j) −z _(j)=[H₂]+[NH]=17

Precursor Ion Relationships

Relationships with the precursor ion mass can also be determined asfollows:

Adding the b_(j) and y_(j) equations and assuming that [N term]=[H] andthat [C term]=[OH] the following results:

${b_{j} + y_{n - j}} = {{\sum\limits_{i = 1}^{n}\; {aa}_{i}} + \lbrack{OH}\rbrack + \lbrack H\rbrack + \left\lbrack H^{+} \right\rbrack + \lbrack H\rbrack}$

The mass of the protonated precursor ion can be calculated as follows:

$\left\lbrack {{precursor}\mspace{14mu} {ion}} \right\rbrack = {{\sum\limits_{i = 1}^{n}\; {aa}_{i}} + \lbrack{OH}\rbrack + \lbrack H\rbrack + \left\lbrack H^{+} \right\rbrack}$

in which the [H⁺] arises as the precursor ion is protonated.

Thus,

b _(j) +y _(n−j)=[precursor ion]+[H]

Electrospray Doubly Charged Ion Relationships

Relationships with electrospray doubly charged ions can be determined asfollows, electrospray samples usually containing an intense doublycharged peak.

$\left\lbrack {{precursor}\mspace{14mu} {ion}} \right\rbrack = {\left( {{\sum\limits_{i = 1}^{n}\; {aa}_{i}} + \lbrack{OH}\rbrack + \lbrack H\rbrack + \left\lbrack H^{+} \right\rbrack + \left\lbrack H^{+} \right\rbrack} \right)/2}$

Thus,

b _(j) +y _(n−j)=2*[precursor ion]

Largest Series Ion and Precursor Relationships

Given that 1<=j<=n−1, the following can be derived:

${\left\lbrack {{precursor}\mspace{14mu} {ion}} \right\rbrack - \left\lbrack b_{n - 1} \right\rbrack} = {{\sum\limits_{i = 1}^{n}\; {aa}_{i}} + \lbrack{OH}\rbrack + \lbrack H\rbrack + \left\lbrack H^{+} \right\rbrack - \left( {{\sum\limits_{i = 1}^{n - 1}\; {aa}_{i}} + \lbrack H\rbrack} \right)}$i.e.

[precursor ion]−[b _(n−1) ]=aa _(n)+[OH]+[H⁺]

and

[precursor ion]−[y _(n−1) ]=aa ₂+[OH]+[H]+[H⁺]−[OH]−[H₂]

i.e.

[precursor ion]−[y _(n−1) ]=aa ₂

The top spectrum in FIG. 2 shows 16 mass peaks which can in turn be usedto determine a very large number of sets of masses, each containingbetween 2 and 16 of the mass peaks shown and which might correlate withthe sequence of the sample polypeptide. The total number of sets m/zpeaks, denoted by M, is represented as:

$\sum\limits_{i = 1}^{16}\; {\,^{16}{iP}}$

The deductive method works by determining a subset m_(DN) (DN is shortfor De-Novo) of M by using predicate calculus concepts. Predicatecalculus is a mathematical method derived from boolean algebra and iswell known in the art.

The following statement is true: ∀m_(DN)εM i.e. each m_(DN) set is asubset of M. (∀ meaning “for all”, and ε denoting a subset).

Thus the m_(DN) sets are deduced using the predicate that each member inthe set must be one amino acid different in mass from its neighbouringat least one element (the “Mass Difference Predicate”). This satisfiesthe Difference Equations (above). As discussed above, the set of aminoacid masses which are made available for the predicate can be those ofthe standard amino acids or can, if desired, include masses for unusualor modified amino acids, or can exclude certain amino acids—the decisionas to which masses are to be included and which are to be excluded canbe made on the basis of available information about the samplepolypeptide, or for example the culture conditions of a micro-organismproducing the sample polypeptide which mean that e.g. isotopicallylabelled versions of amino acids may be comprised in the samplepolypeptide.

A mass spectrum of a sample polypeptide having the sequence of SEQ IDNO: 1 (REGGAIFE) results in the set of peaks having the values 147, 185,197, 213, 215, 225, 243, 262, 296, 324, 333, 342, 409. 437, 455, 462,489, 506, 524, 538, 542, 577, 659, 664, 777, 846, 864, 959, 977 and1076. Candidate m/z peak sets are then derived as shown in Table 1.

To do this, a computer program searches for solutions to the MassDifference Predicate and is presented with the set of m/z peak valuesdetailed in Table 1 by first selecting a root (or “starting point”)mass. The highest mass is used—in the case of Table 1 this is mass 1076.It then looks for all masses that are exactly one amino acid mass lowerin mass than the root mass. Each of these masses in turn is used to findfurther masses that are again separated by amino acid mass values, andthis is repeated until there are no more possible solutions.

The process is then repeated with a new root mass—one lower in theseries than before. In the case shown in Table 1, this is mass 977. Anynew series found are added to the original series found from root mass1076. The process is repeated right down to the penultimate mass 185.

The results of the process consist the group of candidate m/z peak sets.The computational complexity of the task is very substantial since avast number of possible solutions to the Mass Difference Predicate mustbe evaluated, making the task impossible (at least within any kind of areasonable time frame) for a human, even for a simple set of m/z peaksas shown in Table 1—the present invention is capable of determiningcandidate m/z peak sets and putative amino acid sequences from much morecomplex sets of m/z values, and it should be noted that with eachadditional m/z value added to the set to be considered, the number ofpossible sets of m/z values increases in a substantial non-linearmanner.

As can be seen from Table 1, a total of 19 basic candidate m/z peak setsare determined. In addition to these (and not shown for the sake ofsimplicity), contiguous subsequences of each of those shown are alsoderived and are candidate m/z peak sets which are evaluated using themethod of the present invention. An example of a contiguous subsequenceis the series y10, y9, y8-18 shown in column 1—the two subsets y10, y9and y9, y8-18 are also generated and are analysed according to themethod of the present invention. Similarly, the set of peaks at m/zvalues 409, 296 and 197 is also generated and is a subset of the setlisted at column 11, but is not shown for the sake of convenience.

The “Series” column of Table 1 shows information about the m/z peakswhich has been derived as a result of working the method of the presentinvention—the only data initially available is the m/z peak data fromwhich the candidate m/z peak sets are determined and which aresubsequently analysed and filtered to derive the “Series” information,

Table 1 shows a number of series (i.e. candidate m/z peak sets)including the following 5: [y10,y9,y8,y7,y6,y5,y4,y3,y2,y1] (i.e the m/zpeak set 1076, 977, 864, 777, 664, 577, 462, 333, 262, and 147),[y9-18,y8-18] (i.e the m/z peak set 959, and 846), [b5,b4,b3,b2] (i.ethe m/z peak set 542, 455, 342, and 243), [b5-18,b4-18,b3-18,b2-18] (i.ethe m/z peak set 524, 437, 324, and 225), and [a4-18,a3-18,a2-18] (i.ethe m/z peak set 409. 296, and 197).

The first letter denotes the ion series type (which, as mentioned above,is subsequently derived but is included here for the sake ofconvenience—the initial data consists only of m/z peak values) and thesecond number denotes a mass displacement from the species. For examplethe b-18 set denotes a b-series, all with a water molecule (mass 18Daltons) displacement. Certain amino acids will have displaced peaks.Common Displacement Masses are 18 Da (H₂O), 28 Da (CO) and 17 Da (NH₃).

In order to deduce all possible members of the set m_(DN) a computerprogram may be employed as follows—the code detailed below is formattedin the Prolog language and searches for all solutions satisfying a setof logical conditions, or predicates, in this ease the Mass DifferencePredicate. The program language itself uses backtracking methods to findthe set of solutions automatically. Thus the identification of themembers of the set m_(DN) which fulfill the requirements of the MassDifference Predicate can be accomplished with the following Prolog code:

amino_generator([HROOTMASS|TAILMASSES],ACCUMSEQUENCE, RESULT):-      findall(X,amino_diff_set(HROOTMASS,[HROOTMASS| TAILMASSES],[],X),NEXTROOTSEQUENCE),          concat(NEXTROOTSEQUENCE,         ACCUMSEQUENCE,NEWSEQUENCE),         amino_generator(TAILMASSES,          NEWSEQUENCE,RESULT).wherein:

-   -   HROOTMASS=next root mass in sequence    -   TAILMASSES=remaining masses    -   ACCUMSEQUENCE=current set of solutions    -   RESULT=an empty computation variable that receives the final        result    -   amino_diff_set=a Prolog goal    -   X=general unknown variable    -   NEXTROOTSEOLIENCE=a newly found series    -   NEWSEQUENCE=the new current list of solutions

The findall goal (predicate) is used to deduce all possible sequencesstarting from the root mass (HROOTMASS). The findall goal uses theamino_diff_set goal (predicate) which is able to generate all thesequences starting from a given root mass. All the series found are thenadded to the existing series using the concat goal. Finally the goalrecursively calls itself using the TAILMASSES which are the remainingset of mass peaks after the current highest mass (HROOTMASS) has beenremoved. The next call to the amino_generator goal will start with a newHROOTMASS, which is one mass down the set of masses from the last time.

Although the above code is formatted for Prolog, a wide range of otherlanguages such as C, C++, C# and PASCAL can also be used to embody thesame process (e.g. emulate the same program).

Although FIG. 1 demonstrates intact chains of amino acids, it is oftenthe case that one of the ion species is missing and that only partchains exist. The present invention can determine part sequences,providing that they satisfy the Mass Difference Predicate. The abilityof the invention to discover part chains comes from the fact that allmasses in the original m/z peak set are used as possible starting pointsor root masses.

Thus the at least one putative amino acid sequence determined by thepresent invention can is comprise at least two putative partialsequences for said sample polypeptide.

Table 1 shows the Mass Difference Predicate solutions for the set of m/zvalues shown. Four out of five of the correct series have been found,namely the m/z peak sets shown in columns 2, 6, 9 and 15. The set incolumn 11 is the a-18 series with a further peak at mass 538 added, thepeak at 538 not corresponding to SEQ ID NO: 1.

However there are also 14 additional incorrect (or “false”) series inthe list. Further steps are applied to eliminate false series, and thesesteps can be regarded as being filtering steps. The filtering steps arealso used to classify ion series, deriving the information given in the“Series” column of Table 1.

Reflective Predicate

There are two filtering mechanisms. The first filtering mechanism uses aReflective Predicate, which is applied to pairs of m_(DN). The sequencesof amino acids go in reverse order between a-, b- and cions and the x-,y- and z-ions. This property is demonstrated mathematically in theDifference Equations—the C-terminal ions (x-, y- and z-) have theaa_(n−j+1) term, whereas the N-terminal ions have the aa_(j) term. Thereversal or reflected nature of the separation of masses within twom_(DN) sets is demonstrated visually in FIG. 2, the y mass ions set areseparated by the sequence EGGAIFE (SEQ ID NO: 21 and the b mass ions setby the sequence FIAGGER (SEQ ID NO: 3), which is the partial reverse ofthe y set.

Tables 2 and 3 show y- and b-series for a simple peptide LYLKGER (SEQ IDNO: 4).

The {dot over (b)}_(j) and {dot over (y)}_(j) columns representdifferences between adjacent masses. The differencing operation isreferred to as differentiation and is the discrete data analogue ofcontinuous data differentiation. In Table 3 the ordering has beenreversed. The sequences have been obtained from the differentiatedseries. It can be seen that both series share the part sequence YLKGE(amino acids 2-6 of SEQ ID NO: 4).

The Reflective Predicate Filter works by only including those pairs ofm_(DN) that have a match between differentiated masses in reverse order.This property is represented mathematically as:

R _(j) {dot over (m)} _(DN) is a contiguous subsequence of _(k) {dotover (m)} _(DN)

where _(j){dot over (m)}_(DN) denotes a vector of mass differences, orfirst differential of m_(DN). R denotes a reversing diagonal matrix—forexample for a 3×3 matrix it is:

$\begin{matrix}0 & 0 & 1 \\0 & 1 & 0 \\1 & 0 & 0\end{matrix}$

This “contiguous subsequence property” requires the vector on the lefthand side to be wholly contained within the vector on the right handside and in the same order. For example the vector [1,2,3] is acontiguous subsequence of [7,8,1,2,3,5] and not a contiguous subsequenceof [1,2] or [1,2,5,3]. If the property above is satisfied both_(j)m_(DN) and _(k)m_(DN) are included in the filtered subset—the twoseries are known as reflective pairs.

Such reflected series are denoted by m_(R). The normal subset conditionholds:

∀m _(R) εm _(DN) εM

i.e. each m_(R) set is a subset of an m_(DN) set, which in turn is asubset of M.

Table 4 shows the results of the Reflective Predicate Filter. The filtertends to fragment long series. Critically, it acts to eliminate falsepeaks as is demonstrated by the elimination of the false 538 mass peakfrom the a-18 series 11. The series now shown at column 11 of Table 4 isa contiguous subsequence of the series shown at column 11 of Table 1,and which was identified by the method of the present invention but notpreviously shown. Of the 15 series, 7 are correct whole or fragments ofgenuine series corresponding to SEQ ID NO: 1.

Neutral Loss Predicate

Subsequent to the filtering achieved by the Reflective Predicate Filter,an additional filtering method is employed to identify DisplacementIons. In this case it is the Neutral Loss Predicate, which is adeductive method. In other cases, or in addition to inductivepredicates, inductive methods such as Supervised Machine Learningalgorithms can be used to filter and/or classify the candidate m/z peaksets. The Neutral Loss Predicate works in a similar fashion to the MassDifference Predicate described above, but is based upon the neutrallosses which can occur with certain ion species. The exact nature of theneutral losses is determined by the nature of the ions themselves andthe identification of a neutral loss species can therefore be used todetermine information about the ion species. For example, certain aminoacids will undergo certain neutral losses. Therefore, a given ionspecies having a neutral loss can be determined as containing a specificamino acid, or as comprising one of a set of amino acids. Since theneutral loss occurs from the terminus of an ion species it is alsotherefore possible to determine the location of a particular amino acidor of one of a set of amino acids responsible for the neutral loss.

Details of neutral losses which occur with amino acids are given in FIG.3. Examples of neutral losses are 18 (H₂O), 17 (NH₃) and 34 (H₂S).

Once such neutral losses have been identified the candidate m/z peaksets can then be labelled as being neutral loss sets and, asappropriate, as having certain properties, i.e. putatively containingcertain amino acids. In order to further simplify the candidate m/z peaksets, the neutral loss sets can be correlated with the candidate m/zpeak sets containing the same ordered peaks, which can in turn beattributed any labels given to the mass peaks of one another, and theshortest candidate m/z peak set discarded.

As can be seen from Table 4, the candidate m/z peak sets remaining afterthe Reflective Predicate Filter contain members which have m/z valueswhich are displaced from those in other sets by fixed amounts. Forexample, in column 6 of Table 4 the candidate m/z peak set has themembers 959 and 846. They are displaced by 18 u from the 977 and 864members of the candidate m/z peak set given at column 5 of Table 4,which has members with values 1076, 977, 864 and 777.

Thus this filtering mechanism works by selecting pairs of sets that have2 or more peaks separated by known Displacement Masses. From theDifference Equations, it can be seen that: b_(j)−a_(j)=28

As well as providing a filtering mechanism, the Mass DifferencePredicates can also be used as a series classifier, as only a- andb-series are displaced by 28 Daltons. The classification into seriestype is discussed further below.

As well as a- and b-series, it is possible to have y- and x-series ionswhich are separated by 26 Daltons, although in practise x-series arerarely found. As is the case with the Reflective Predicate Filter, bothseries satisfying the Mass Difference Predicate are included into thefiltered set. Such pairs of series are known as Displacement Pairs.

The Mass Difference Predicate also holds between b- and c-series—in thiscase the mass difference value is that of NH₃—17 Daltons. Similarly, y-and z-series ions are displaced by 17 Daltons.

All 28 Daltons mass difference series are represented as m_(DN-28) andagain the subset is condition holds:

∀m _(DN-28) εm _(DN) εM

In some instances the Mass Difference Predicates are applied to thesubset of reflectively filtered series, and in this case:

∀m _(DN-28) εm _(R) εm _(DN) εM

With the remaining candidate m/z peak sets, an additional step isperformed to remove those candidate m/z peak sets which are contiguoussubsequences of others (i.e those whose members do not just form asubset of the members of another set or sets, but whose members form acontiguous subsequence of another set or sets).

The amino acid sequences represented by the candidate m/z peak sets canbe determined by a simple look-up of the mass differences between them/z peak values against a set of amino acid masses. Thus a set of m/zpeak values can be readily translated into an amino acid sequence. Thiscan be done starting from the heaviest m/z peak value or from thelightest or in any other order. However, the resultant sequence stillneeds to be assigned a directionality—the on species from which thesequences are determined could be a, b, or c-ions which will give, goingfrom the heaviest m/z peak value to the lightest, an amino acid sequencein the direction C to N. Alternatively, the ion species from which thesequences are determined could be x-, y-, or z-ions which will give,going from the heaviest m/z peak value to the lightest, an amino acidsequence in the direction N to C. Amino acid sequences are commonlyrepresented in the N to C direction and sequence derived from a-, b- andc-ion species must not be confused with sequence derived from x-, y- andz-ion species.

Thus a step in the process of determining a putative amino acid sequencecan be to determine the directionality of the amino acid sequencesrepresented by the candidate m/z peak sets, and this can be done asdiscussed by Papayannopoulos, IA (supra). Essentially, since theprecursor mass for the sample polypeptide has been determined, and thecandidate m/z peak set values are known, the heaviest values in each ofthe sets can be compared with the precursor ion mass to identify anydifferences which correlate to the mass of an amino acid, or to the massof an amino acid+18 u. If a candidate m/z peak set is found whosegreatest m/z peak value is equal to the sample polypeptide precursormass minus the mass of an amino acid then the candidate m/z peak set isa y-series whose C terminal member is the amino acid corresponding tothe mass difference. Alternatively, if a candidate m/z peak set is foundwhose greatest m/z peak value is equal to the sample polypeptideprecursor mass minus 18 and minus the mass of an amino acid then thecandidate m/z peak set is a b-series whose N-terminal member is theamino acid corresponding to the mass difference plus 18.

For example, in FIG. 2 two candidate m/z peak sets are showncorresponding to the sequences REGGAIFE (SEQ ID NO: 1) and EFIAGGER(SEQ. ID NO: 16) but there is no indication of directionality—the abovestep allows a directionality to be assigned to candidate m/z peak sets.

b-y Series Classification (Classification Predicate)

In order to determine whether a candidate m/z peak set might be a b- ory-series, the following can be derived from the b_(j) and y_(j)equations:

$\begin{matrix}{\left\lbrack b_{n - 1} \right\rbrack = {{{\sum\limits_{i = 1}^{n - 1}\; {aa}_{i}} + \left\lbrack {N\text{-}{term}} \right\rbrack} = {\left\lbrack {{{precursor}\mspace{14mu} {ion}} + H^{+}} \right\rbrack - {aa}_{n} - 18}}} & \left\lbrack {{Equation}\mspace{14mu} 1} \right\rbrack\end{matrix}$y _(n−1)=[precursor+H ⁺ ]−aa ₂  [Equation 2]

Thus the following is true:

b _(n)=[precursor ion+H⁺]−18  [Equation 3]

When the conditions set out in Equations 1-3 are applied to the highestmass, a number of different scenarios can result namely:

-   -   1. The series is ambiguously classified as both y- and b-series;    -   2. The series is classified as a b-series and not as a y-series;    -   3. The series is classified as a y-series and not as a b-series;        and    -   4. The series is not classified as either a b-series or y-series

Possible b- and y-series are found as pairs satisfying the reflectivepredicate. To these pairs the classification of each series in the pairis attempted using the equations above. The 4 scenarios for each seriesin the pair gives 16 possible outcomes, which can be represented by amatrix. The decision made for classification can be made by using alogic based Karnaugh map as shows in FIG. 7.

With reference to FIG. 7, it can be seen that for example if Series1 isclassified as a b-series and not as a y-series but Series2 isambiguously classified as both a b- and y-series then Series1 isclassified as a b-series and Series2 as a y-series.

Of the 16 scenario conditions, 6 can result in a classification error.If a classification error results then Difference Series can becalculated for difference values of −28 (a-series), −45 (a-17 series)and −46 (a-18 series). The total number of a-series and their neutralloss displacements can be compared for Series1 and Series2 and whicheverseries has the most is classified as the b-series. This classificationmethod relies on the assumption that a-iv series are common and x-seriesare rare.

Finding Neutral Loss and a-Series from b- and y-Series:

Further neutral loss series and a-series, together with a neutral lossseries can be calculated for the already classified b- and y-series.Displacement values of −17, −18, −28, −45 and −46 are used torespectively compute b-17, b-18, a-, a-17 and a-18 Displacement Seriesfrom the classified b-series. Similarly displacement values of −17 and−18 are used to respectively compute y-17 and y-18 Displacement Seriesfrom the classified y-series. From the b- and y-series relationship withprecursor mass, namely:

b _(j) +y _(n−j)=[precursor ion]+1  [Equation 4]

equations can be derived to compute a-series from y-series and y-17series from b-series as follows:

a _(j) +y _(n−j)=[precursor ion]−27  [Equation 5]

b _(j) +y−17_(n−j)[precursor ion]−16  [Equation 6]

Similar equations involving y-series ion masses with b-17, b-18, a-17and a-18 ion series masses and the precursor ion mass can be derived byreplacing −27 in Equation 5 by respectively −16, −17, −44 and −45. Asimilar relationship involving the b-series ion masses with y-18 and theprecursor ion mass can be derived by replacing −16 in Equation 6 with−17.

Using the two methods outlined above it is readily possible to determineDisplacement Series from series from the same terminal end with a simpledisplacement factor and from the opposite end from a simple displacementfactor together with the mass of the precursor ion. Thus it is possibleto have two version of the same Displacement Series type, say a-18,calculated from either terminal end of the peptide. Namely an a-18series represented by the two sets:

a-18_(CTerm);and

a-18_(NTerm)

An attempt can then be made to combine the two sets, and if the combinedset satisfies the Mass Difference Predicate then it can be used in placeof each of the two sets. If it does not satisfy the Mass DifferencePredicate then both series are separately reported.

Thus:

a-18_(combined) =a-18_(CTerm) +∪a-18_(NTerm)

y _(n−1)=[precursor+H⁺ ]−aa ₂  [Equation 7]

Given that the C-terminus is OH, the following is true:

y ₁ =aa _(n)[C-term]+[H]+[H⁺ ]=aa _(n)+19  [Equation 8]

Scoring of m/z Peak Sets:

A score system for b- and y-series is calculated by summing the numberof possible Displacement Masses for each series. All other series aregiven a score of 0. The possible Displacement Masses and offsets areshown in Table 7. The total number of Displacement Masses for eachpossible Displacement Series is divided by the number of masses ineither the b- or the y-series (as appropriate), giving a value <−1.Table 7 shows the Displacement Series used for b- and y-series. Theb-series have 5 possible Displacement Series giving a maximumdisplacement score of 5. The y-series have 2 possible DisplacementSeries and have a maximum score of 2.

For b-series, an additional scoring factor, based on classifying thehighest mass in a series as either the highest mass or the secondhighest mass in a b-series, is calculated. For y-series, the highestmass is classified against only a y-series second highest mass, as thehighest mass in a y-series is the same as that of the precursor ionmass. These top mass conditions are set out in Equations 1-3.

If the highest mass in a y-series does not meet this criteria then anattempt to classify the lowest mass in the series as a y₁ ion is made.The lowest mass in the y-series is classified as a y₁ ion if it meetsthe condition in Equation 10. If either of the criteria described is metfor a b- or y-series then 1.0 is added to the displacement score, givinga composite score. The logic for the score adjustment is shown in flowchart form in FIG. 6. In the chart of FIG. 6, 10 is equivalent to “Yes”,20 is “No”. Other parts of the chart are as follows:

-   30: b_(top)=[Precursor ion+H⁺]−18-   40: b_(top−1)=[Precursor ion+H⁺]−18−[amino mass]-   50: b_(adjusted score)=Displacement Score-   60: b_(adjusted score)=Displacement Score+1-   70: y_(top−1)=[Precursor+H⁺]−[amino mass]-   80: y_(adjusted score)=b_(adjusted score)+1-   90: y_(bottom)=[amino mass]+19-   100: y_(adjusted score)=b_(adjusted score)

$\begin{matrix}{y_{j} = {\left\lbrack {C\mspace{14mu} {term}} \right\rbrack + {\sum\limits_{i = {({n - j + 1})}}^{n}\; {aa}_{i}} + \lbrack H\rbrack + \left\lbrack H^{+} \right\rbrack}} & \left\lbrack {{Equation}\mspace{14mu} 9} \right\rbrack\end{matrix}$

[C-term] is usually OH, and so y₁ can usually be derived as being:

y ₁ =aa ₁+19  [Equation 10]

Determining Candidate m/z Peak Sets from (MS)^(n) Data

As is discussed above, the present invention can be used with (MS)^(n)data where n>2, FIG. 8 shows the multiple paths that can be obtained fora complex (ms)^(n) tree data structure. All 4 paths shown share a common(ins)² spectrum, but each has a separate (ms)³ spectrum. FIGS. 9-12 showhow each of the 4 possible mass paths are resolved. The masses shown asdotted lines are not used in the construction of a compound (ms)^(n)spectrum. Thus only a single precursor ion m/z peak is taken from an ms¹spectrum. With subsequent (ms)^(n) spectra, all peaks having an m/zvalue greater than or equal to the m/z value of the precursor ion usedin any subsequent (ms)^(n) spectrum. Thus if there are no further(ms)^(n) spectra obtained, all peaks in the final spectrum are used.Each series of masses from the path is used as an input into theputative series predicate. Each of the putative series is then used asan input to other predicates in order to determine by pairs etc. Theamino acid sequence, determined from each path may differ or may beidentical, if the sequences are identical they are combined. Scoring canalso be incorporated, and a composite score associated with thesequence. The composite score is calculated by summing the individualsequence scores determined as described above.

The determination of Difference Sets can be done as described above, orsupervised learning algorithms can be used. As an example, Table 5 showsa part of a training data set passed to a supervised learning algorithmand which has just six m/z peaks labelled M1-M6. Table 6 shows examplesof classification given to the m/z peaks—the “series” column denotes theactual type of series which is represented by the m/z peaks, and the“classification” column denotes the classification assigned to theseries by the supervised learning algorithm.

Combining (Splicing) and Scoring Sequences

FIG. 2 demonstrates how sequences are determined, by subtractingadjacent series masses, in an ideal situation. It is possible to have anumber of different scenarios when combining sequences from N- andC-terminal series. The different scenarios are demonstrated in FIG. 14along with the mechanism used to combine (splice) the sequences. In FIG.14, each circle represents a single amino acid in the sequence, such asG or A. Cross-hatched circles represent the longer series, and the solidcircles represent the shorter series. Circles with cross-hatching andsolid colouring represent parts of the long and short series that arecommon, i.e. the overlap segment.

In FIG. 14, 110, 140, 170, 210 and 250 represent long sequences. 120,150, 180, 220 and 260 represent short sequences. 130 represents a singlesequence resulting from splicing, namely long sequence 110. 160represents a single extended sequence resulting from the splicing ofsequences 140 and 150. 190 represents a first long sequence (Sequence 1)resulting from the splicing of 170 and 180. 200 represents a secondright-spliced sequence (Sequence2) resulting from the splicing of 170and 180. 230 represents a first long sequence (Sequence1) resulting fromthe splicing of 210 and 220. 240 represents a second short sequence(Sequence2) resulting from the splicing of 210 and 220. 270 represents afirst long sequence (Sequence1) resulting from the splicing of 250 and260, 280 represents a second left-spliced sequence (Sequence2) resultingfrom the splicing of 250 and 260. Each of the scenarios demonstrated(110-130, 140-160, 170-200, 210-240, and 250-180) has a region ofoverlap of 3 amino acids. The scenarios depicted as 110-130 and 140-160are most common, where each of the long and short series are consistentwith one another. For both of these cases a single series, which may bethe longer series, or an extended version of the longer series, isgenerated. The single sequence deduced from each of the b- and y-seriesin FIG. 2 are represented by the scenario depicted as 140-160. 6 aminoacids FIAGGE (amino acids 1-6 of SEQ ID NO: 3) are common.

The remaining scenarios (170-200, 210-240, and 250-180) show themechanism used when the sequences are inconsistent with one another. Forthese scenarios, the longer of the two sequences is made one of thesolution sequences. The other solution depends on whether an end sectionof the shorter sequence is an overlapping segment. The third scenario(170-200) has the right hand last 3 amino acids of the shorter sequencecommon with a middle part of the longer sequence. Sequence 2 is formedby taking the entire shorter sequence and adding the part of the longersequence, after the overlap segment.

In the fourth scenario (210-240), the short sequence has differingmnemonics either side of the common segment, no attempt is made tocombine the sequences, and the longer and shorter sequences become the 2solutions. The fifth scenario (250-280) is similar to the third, butthis time the common segment occurs on the left hand side of the shorterseries.

The score value attributed to a sequence is based on the score value ofthe series from which it is deduced. For the first and second scenarios(110-130 and 140-1160), which combine sequences, the score of theresulting single series is calculated by summing the individual seriesscores. For all of the other scenarios, Sequence 1 is given the score ofthe longer series and Sequence 2 that of the shorter.

With this information, at least one putative amino acid sequence hasbeen determined for a sample polypeptide.

It will be appreciated that it is not intended to limit the invention tothe above example only, many variations, such as might readily occur toone skilled in the art, being possible, without departing from the scopethereof as defined by the appended claims.

TABLE 1 candidate m/z peak sets deduced from mass spectrum Series Mass 12 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 y1 147 y y y y y LV-28 185a2-18 197 a-18 a-18 a-18 Internal 213 N a2 215 b2-18 225 b-18 b-18 b-18b-18 b-18 b2 243 b b y2 262 y y y y a3-18 296 a-18 a-18 a-18 a-18 b3-18324 b-18 b-18 b-18 y3 333 y y y b3 342 b b b a4-18 409 a-18 a-18 a-18a-18 a-18 a-18 b b4-18 437 b-18 b4 455 b b b b-18 y4 462 y y — 489 N N NN — 506 N N b5-18 524 b-18 b-18 2y10 + 2 538 N N N N b5 542 b b b y5 577y y y y y6 664 y y y y y Precursor 659 y7 777 y y y y y8-18 846 y-18y-18 y8 864 y y y y y9-18 959 y-18 y9 977 y y y y10 1076 y y y y y

TABLE 2  b-series of LYLKGER (SEQ ID NO: 4) j b_(j) {dot over (b)}_(j)aa_(j) 1 114 — — 2 277 163 Y 3 390 113 L 4 518 128 K 5 575  57 G 6 704129 E 7 860 156 R

TABLE 3  y-series of LYLKGER (SEQ ID NO: 4) j y_(j) {dot over (y)}_(j)aa_(j) 7 878 113 L 6 765 163 Y 5 602 113 L 4 489 128 K 3 361  57 G 9 304129 E 1 175 — —

TABLE 4 Remaining candidate m/z peak sets after application of theReflective Predicate Filter Series Mass 1 2 3 4 5 6 7 8 9 10 11 12 13 1415 y1 147 y LV-28 185 a2-18 197 a-18 Internal 213 N a2 215 b2-18 225b-18 b2 243 b y2 262 y y y y a3-18 296 a-18 a-18 a-18 b3-18 324 y3 333 yy b3 342 b b a4-18 409 a-18 a-18 a-18 a-18 a-18 b4-18 437 b4 455 b y4462 y y — 489 N — 506 N b5-18 524 2y10 + 2 538 N N b5 542 b y5 577 y yy6 664 Precursor 659 y7 777 y y8-18 846 y-18 y-18 y8 864 y y9-18 959y-18 y9 977 y y y10 1076 y y

TABLE 5  part of a set of training data Sequence M1 M2 M3 M4 M5 M6 SerSEQ YELQDPR 175 272 387 515 628 757 y 5 YELQDPR 114 213 326 441 498 611b 5 IVLDGLY 182 295 352 467 580 679 y 6 IVLDGLY 157 214 377 506 605 704b 6 RALPPLR 175 288 385 482 595 666 y 7 RALFPLR 167 293 406 534 649 746b 7 RGYEVVR 114 213 376 447 633 704 b 8 Column headed “Ser” denotesSeries Column headed “SEQ” denotes SEQ ID NO

TABLE 6  example results of Difference Setclassification by supervised learningalgorithm, the method being applicable to Difference Sets. Sequence M1M2 M3 M4 M5 M6 Ser Cls SEQ IWIDGVR 114 300 413 528 585 684 b b  9IWIDGVR 175 274 331 446 559 745 y y  9 MEGNDLK 132 261 318 432 547 660 by 10 MEGNDLK 147 260 375 489 546 675 y y 10 LGELDGR 114 171 300 413 528585 b b 11 LGELDGR 175 232 347 460 589 646 y b 11 HYARGVR 138 301 372528 585 684 b b 12 HYARGVR 175 274 331 487 558 721 y y 12 IEGITAR 114243 300 413 514 585 b y 13 IEGITAR 175 246 347 460 517 646 y y 13NWARGVR 115 301 372 528 585 684 b b 14 NWARGVR 175 274 331 487 558 744 yy 14 NADINRR 115 186 301 414 528 684 b y 15 NADINRR 175 331 445 558 673744 y y 15 Column headed “Ser” denotes Series Column headed “SEQ”denotes SEQ ID NO Column headed “Cls” denotes classification of seriesby supervised learning algorithm

TABLE 7 scoring Displacement Series Main Series Displacement SeriesDisplacement Mass b b-17 −17 b b-18 −18 b a −28 b a-17 −45 b a-18 −46 yy-17 −17 y y-18 −18

1-22. (canceled)
 23. A method for determining at least one amino acidsequence for a sample polypeptide, the method being implemented in acomputer comprising a memory storing processor readable instructions anda processor in communication with said memory, the method comprising:receiving, as input to the processor, a set of m/z peaks associated witha soft ionization mass spectrum obtained from said sample polypeptide,each of said m/z peaks having an associated m/z value; identifying, bythe processor, a plurality of m/z peak sets, each of said plurality ofm/z peak sets corresponding to a possible amino acid sequence associatedwith the received set of m/z peaks; processing, by the processor, saidplurality of m/z peak sets using a reflective predicate filter toidentify m/z peak sets; discarding any identified m/z peak sets fromsaid plurality of m/z peak sets; and outputting said at least one aminoacid sequence based upon said remaining plurality of m/z peak sets. 24.The method of claim 23, wherein processing said plurality of m/z peaksets using a reflective predicate filter to identify m/z peak setscomprises, for each m/z peak set: comparing the possible amino acidsequence associated with the m/z peak set with possible amino acidsequences associated with other ones of the plurality of m/z peak setsin reverse order.
 25. The method of claim 23, wherein processing saidplurality of m/z peak sets using a reflective predicate filter toidentify m/z peak sets comprises: for each of said m/z peak sets:determining mass differences between each pair of neighbouring m/z peaksof the m/z peak set based upon the m/z values associated with each pairof m/z peaks; determining a sequence of mass differences based upon thedetermined mass differences; and determining a sequence of massdifferences in reverse order based upon the determined mass differences;and identifying m/z peak sets whose sequence of mass differences doesnot form at least part of a sequence of mass differences in reverseorder of another of said m/z peak sets.
 26. The method of claim 23,wherein identifying a plurality of m/z peak sets, each of said pluralityof m/z peak sets corresponding to a possible amino acid sequenceassociated with the received set of m/z peaks, comprises: determining aplurality of said received set of m/z peaks that each have an associatedm/z value that differs from an m/z value associated with another of saidplurality of said received set of m/z peaks by the mass of an aminoacid.
 27. The method of claim 23, further comprising: identifying anym/z peak set which is a contiguous subsequence of another m/z peak set;and discarding any identified m/z peak sets from the plurality ofcandidate m/z peak sets.
 28. The method of claim 23, further comprising:identifying candidate m/z peak sets in which mass differences betweeneach m/z peak and at least one peak in the candidate m/z peak set havinga closest m/z value above and/or below the peak correspond to that of adaughter ion or an end ion cluster; and discarding any identified m/zpeak sets from the plurality of candidate m/z peak sets.
 29. The methodof claim 23, wherein outputting said at least one amino acid sequencebased upon said remaining plurality of m/z peak sets comprises: for eachof said remaining plurality of m/z peak sets, determining an amino acidsequence based upon a difference between m/z values associated withpeaks of the m/z peak set.
 30. A computer program product fordetermining at least one amino acid sequence for a sample polypeptide,the computer program product comprising a non-transient computer usablemedium having a computer readable program code, said computer readableprogram code comprising instructions arranged to cause a computer to:receive, as input to the processor, a set of m/z peaks associated with asoft ionization mass spectrum obtained from said sample polypeptide,each of said m/z peaks having an associated m/z value; identify, by theprocessor, a plurality of m/z peak sets, each of said plurality of m/zpeak sets corresponding to a possible amino acid sequence associatedwith the received set of m/z peaks; process, by the processor, saidplurality of m/z peak sets using a reflective predicate filter toidentify m/z peak sets; discard any identified m/z peak sets from saidplurality of m/z peak sets; and output said at least one amino acidsequence based upon said remaining plurality of m/z peak sets.
 31. Thecomputer program product of claim 30, wherein processing said pluralityof m/z peak sets using a reflective predicate filter to identify m/zpeak sets comprises, for each m/z peak set: comparing the possible aminoacid sequence associated with the m/z peak set with possible amino acidsequences associated with other ones of the plurality of m/z peak setsin reverse order.
 32. The computer program product of claim 30, whereinprocessing said plurality of m/z peak sets using a reflective predicatefilter to identify m/z peak sets comprises: for each of said m/z peaksets: determining mass differences between each pair of neighbouring m/zpeaks of the m/z peak set based upon the m/z values associated with eachpair of m/z peaks; determining a sequence of mass differences based uponthe determined mass differences; and determining a sequence of massdifferences in reverse order based upon the determined mass differences;and identifying m/z peak sets whose sequence of mass differences doesnot form at least part of a sequence of mass differences in reverseorder of another of said m/z peak sets.
 33. The computer program productof claim 30, wherein identifying a plurality of m/z peak sets, each ofsaid plurality of m/z peak sets corresponding to a possible amino acidsequence associated with the received set of m/z peaks, comprises:determining a plurality of said received set of m/z peaks that each havean associated m/z value that differs from an m/z value associated withanother of said plurality of said received set of m/z peaks by the massof an amino acid.
 34. The computer program product of claim 30, whereinsaid computer readable program code further comprises instructionsarranged to cause a computer to: identifying any m/z peak set which is acontiguous subsequence of another m/z peak set; and discarding anyidentified m/z peak sets from the plurality of candidate m/z peak sets.35. The computer program product of claim 30, further comprising:identifying candidate m/z peak sets in which mass differences betweeneach m/z peak and at least one peak in the candidate m/z peak set havinga closest m/z value above and/or below the peak correspond to that of adaughter ion or an end ion cluster; and discarding any identified m/zpeak sets from the plurality of candidate m/z peak sets.
 36. Thecomputer program product of claim 30, wherein outputting said at leastone amino acid sequence based upon said remaining plurality of m/z peaksets comprises: for each of said remaining plurality of m/z peak sets,determining an amino acid sequence based upon a difference between m/zvalues associated with peaks of the m/z peak set.
 37. A system fordetermining at least one amino acid sequence for a sample polypeptide,said system comprising: a memory storing processor readableinstructions; and a processor arranged to read and execute instructionsstored in said memory; wherein said processor readable instructionscontain instructions to cause a computer to: receive, as input to theprocessor, a set of m/z peaks associated with a soft ionization massspectrum obtained from said sample polypeptide, each of said m/z peakshaving an associated m/z value; identify, by the processor, a pluralityof m/z peak sets, each of said plurality of m/z peak sets correspondingto a possible amino acid sequence associated with the received set ofm/z peaks; process, by the processor, said plurality of m/z peak setsusing a reflective predicate filter to identify m/z peak sets; discardany identified m/z peak sets from said plurality of m/z peak sets; andoutput said at least one amino acid sequence based upon said remainingplurality of m/z peak sets.
 38. The system of claim 37, whereinprocessing said plurality of m/z peak sets using a reflective predicatefilter to identify m/z peak sets comprises, for each m/z peak set:comparing the possible amino acid sequence associated with the m/z peakset with possible amino acid sequences associated with other ones of theplurality of m/z peak sets in reverse order.
 39. The system of claim 37,wherein processing said plurality of m/z peak sets using a reflectivepredicate filter to identify m/z peak sets comprises: for each of saidm/z peak sets: determining mass differences between each pair ofneighbouring m/z peaks of the m/z peak set based upon the m/z valuesassociated with each pair of m/z peaks; determining a sequence of massdifferences based upon the determined mass differences; and determininga sequence of mass differences in reverse order based upon thedetermined mass differences; and identifying m/z peak sets whosesequence of mass differences does not form at least part of a sequenceof mass differences in reverse order of another of said m/z peak sets.40. The system of claim 37, wherein identifying a plurality of m/z peaksets, each of said plurality of m/z peak sets corresponding to apossible amino acid sequence associated with the received set of m/zpeaks, comprises: determining a plurality of said received set of m/zpeaks that each have an associated m/z value that differs from an m/zvalue associated with another of said plurality of said received set ofm/z peaks by the mass of an amino acid.
 41. The system of claim 37,wherein said computer readable program code further comprisesinstructions arranged to cause a computer to: identifying any m/z peakset which is a contiguous subsequence of another m/z peak set; anddiscarding any identified m/z peak sets from the plurality of candidatem/z peak sets.
 42. The system of claim 37, further comprising:identifying candidate m/z peak sets in which mass differences betweeneach m/z peak and at least one peak in the candidate m/z peak set havinga closest m/z value above and/or below the peak correspond to that of adaughter ion or an end ion cluster; and discarding any identified m/zpeak sets from the plurality of candidate m/z peak sets.
 43. The systemof claim 37, wherein outputting said at least one amino acid sequencebased upon said remaining plurality of m/z peak sets comprises: for eachof said remaining plurality of m/z peak sets, determining an amino acidsequence based upon a difference between m/z values associated withpeaks of the m/z peak set.